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Abstract 

We introduce a new partial order on the class of stochastically monotone Markov kernels 
having a given stationary distribution tt on a given finite partially ordered state space X. 
When K < L m this partial order we say that K and L satisfy a comparison inequality. 
We establish that if Ki, . . . ,Kt and Li, . . . ,Lt are reversible and ^ Ls for s = 1, . . . ,t, 
then Ki ■ ■ ■ Kt ^ Li ■ ■ ■ Lt- In particular, in the time-homogeneous case we have A'* -< L* 
for every tiiK and L are reversible and K < L, and using this we show that (for suitable 
common initial distributions) the Markov chain Y with kernel K mixes faster than the 
chain Z with kernel L, in the strong sense that at every time t the discrepancy — measured 
by total variation distance or separation or L'^-distance — between the law of Yt and vr is 
smaller than that between the law of Zt and tt. 

Using comparison inequalities together with specialized arguments to remove the sto- 
chastic monotonicity restriction, we answer a question of Persi Diaconis by showing that, 
among all symmetric birth-and-death kernels on the path X = {0, . . . , n}, the one (we call 
it the uniform chain) that produces fastest convergence from initial state to the uniform 
distribution has transition probability 1/2 in each direction along each edge of the path, 
with holding probability 1/2 at each endpoint. 

We also use comparison inequalities 

(i) to identify, when tt is a given log-concave distribution on the path, the fastest- 
mixing stochastically monotone birth-and-death chain started at 0, and 

(ii) to recover and extend a result of Peres and Winkler that extra updates do not 
delay mixing for monotone spin systems. 

Among the fastest-mixing chains in (i), we show that the chain for uniform n is slowest 
in the sense of maximizing separation at every time. 

1. Introduction and summary 

A series of papers [SI [321 HI S] by Boyd, Diaconis, Xiao, and coauthors considers 
the following "fastest-mixing Markov chain" problem. A finite graph G = (V, E) 
is given, together with a probability distribution tt on such that 7r(i) > for 
every i\ the goal is to find the fastest-mixing reversible Markov chain (FMMC) 
with stationary distribution tt and transitions allowed only along the edges in E. 
This is a very important problem because of the use of Markov chains in Markov 
chain Monte Carlo (MCMC), where the goal is to sample (at least approximately) 
from TT and the Markov chain is constructed only to facilitate generation of such 
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observations as efficiently as possible. As their criterion for FMMC, the authors 
minimize SLEM (second-largest eigenvalue in modulus — sometimes also called the 
absolute value of the "largest small eigenvalue" — defined as the absolute value of the 
eigenvalue of the one-step kernel with largest absolute value strictly less than 1), 
and they find the FMMC using semidefinite programming. (More precisely, [6l[5l|4] 
do this; |32| similarly deals with continuous-time chains and minimizes relaxation 
time. See these papers for further references; in particular, related work is found 

While most of the results in the series are numerical, both [S] and [3] contain 
analytical results. For the problem treated in [5] (which, as explained there, has 
an application to load balancing for a network of processors [10]), the graph G is 
a path (say, on = {0, . . . , n}, with an edge joining each consecutively-numbered 
pair of vertices) with a self-loop at each vertex, tt is the uniform distribution, and 
it is proved that the FMMC has transition probability p{i, i + = p{i + = 1/2 
along each edge and p{i,i) = except that p(0, 0) = 1/2 = p{n,n). [We will call 
this the uniform chain U ~ (C/t)t=o,i....-l 

The mixing time of a Markov chain can indeed be bounded using the SLEM, 
which provides the asymptotic exponential rate of convergence to stationarity. (See, 
6-g-j H] for background and standard Markov chain terminology used in this pa- 
per.) But the SLEM provides only a surrogate for true measures of discrepancy 
from stationarity, such as the standard total variation (TV) distance, separation 
(sep), and L^-distance. For the path problem, for example, Diaconis [personal 
communication[ has wondered whether the uniform chain might in fact minimize 
such distances after any given number of steps (when, for definiteness, all chains 
considered must start at 0). In this paper we show that this is indeed the case: 
The uniform chain is truly fastest-mixing in a wide variety of senses. Consider any 
t > 0. What we show, precisely, is that, for any birth-and-death chairQ X having 
symmetric transition kernel on the path and initial state 0, the probability mass 
function (pmf) ttj of Xt majorizes the pmf at of Ut- (A definitive reference on the 
theory of majorization is |21|.) We will show using this that four examples of dis- 
crepancy from uniformity that are larger for Xt than for Ut are (i) LP(7r)-distance 
for any 1 < p < oo (including the standard TV and distances); (ii) separation; 
(iii) Hellinger distance; and (iv) KuUback-Leibler divergence. 

The technique we use to prove that ttj majorizes at is new and remarkably 
simple, yet quite general. In Section [5] we describe our method of comparison 
inequalities. We show (Corollary 12. 5p that if two Markov semigroups satisfy a 
certain comparison inequality at time 1, then they satisfy the same comparison 
inequality at all times t. We also show, in Section [3] (see especially Corollarv 13.31 . 
how the comparison inequality can be used to compare mixing times — in a variety 
of senses — for the chains with the given semigroups. 

In Section!?] we show that, in the context of the above path-problem (of ffiiding 
the FMMC on a path), if one restricts either (i) to monotone chains, or (ii) to 
even times, then the uniform chain satisfies a favorable comparison inequality in 
comparison with any other chain in the class considered. Somewhat delicate ar- 
guments (needed except in the case of i^-distance) specific to the path-problem 
allow us to remove the parity restriction from the conclusion that the uniform 
chain is fastest. (See Theorem 14.31 ) Further, comparisons between chains — even 
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time-inhoniogeneous ones — other than the fastest U can be carried out with our 
method by hmiting attention either to monotone kernels or to two-step kernels. 
Indeed, our Proposition 12.41 rather generally provides a new tool for the notori- 
ously difficult analysis of time-inhomogeneous chains, whose nascent quantitative 
theory has been advanced impressively in recent work of Saloff-Coste and Ziifiiga 
[281 [Ml EH [30]. 

In Section [5] (see Theorem lS.ip . we generalize our path-problem result as follows. 
Let TT be a log-concave pmf on X = {0, . . . ,ti}. Among all monotone birth-and- 
death kernels K, the fastest to mix (again, in a variety of senses) is K^r with (death, 
hold, birth) probabilities given by 

Qi — ,1 ''^ ~ / , \r , ^ ' Pi — I 

TTj-l + TTi + TTi){Tri + TT^+l) TT, + 7r, + l 

(This reduces to the uniform chain when tt is uniform.) 

In Section[n]we revisit the birth-and-death problems of Sections[3HS]in terms of an 
alternative notion of mixing time employed by Lovasz and Winkler [50]. Consider, 
for example, the path-problem of Section [4] For every even value of n the uniform 
chain is fastest-mixing in their sense, too. But, perhaps somewhat surprisingly, 
for every odd value of n the uniform chain is not fastest-mixing in their sense; we 
identify the chain that is. 

In Section [7] we discuss a simple "ladder" game, where the class of kernels is a 
certain subclass of the symmetric birth-and-death kernels considered in Section [4l 

In Section[8]we show how comparison inequalities can recover and extend (among 
other ways, to certain card-shuffling chains) a Peres- Winkler result about slowing 
down mixing by skipping ("censoring") updates of monotone spin systems. (This is 
an example of comparison inequalities applied to time-inhomogeneous chains.) 

2. Comparison inequalities 

In this section we introduce our new concept of comparison inequalities. Consider 
a pmf TT > on a given finite partially ordered state space X. We utilize the usual 
L^(7r) inner product 

(2.1) (/,g)EE(/,5)^:=^^(z)/(z)5(z); 

if a matrix K is regarded in the usual fashion as an operator on i^(7r) by regard- 
ing functions on X as column vectors, then the i^(7r)-adioint of K (also known 
as the time-reversal of K, when K is a Markov kernel) is K* with K*{i,j) = 
7r(j)if (j, i)/7r(i). Reversibility with respect to tt for a Markov kernel K is simply 
the condition that K is self-adjoint. 

Let /C, A4, and T denote the respective classes of (i) Markov kernels on X 
with stationary distribution tt, (ii) nonnegative non- increasing functions on X , and 
(iii) kernels K from /C that are stochastically monotone (meaning that Kf £ M for 
every f £ A4). Note for future reference that the identity kernel / always belongs 
to J^, regardless of tt. Define a comparison inequality relation ^ on K. by declaring 
that K ^ L ii {Kf, g) < {Lf, g) for every f,g £ Ai, and observe that K ^ L ii and 
only if the time- reversals K* and L* satisfy K* < L* . 
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Remark 2.1. (a) Clearly, 

(i) to verify a comparison inequality K ^ L by establishing {K f, g) < {Lf, g) , 
it is sufficient to take / and g to be indicator functions of down-sets (i.e., 
sets D such that y € D and x < y implies x € D) in the partial order; and 

(ii) if a comparison inequality holds, then the condition that / and g be non- 
negative can be dropped, if desired. 

(b) There is an important existing notion of stochastic ordering for Markov 
kernels on X: We say that L <st K if Kf < Lf entrywise for all f G Ai. It is 
clear that L <st K implies K < L when K and L belong to F. But in all the 
examples in this paper where we prove a comparison inequality, we do not have 
stochastic ordering. This will typically be the case for interesting examples, since 
the requirement for distinct K^L G to have the same stationary distribution 
makes it difficult (though not impossible) to have L <st K. 

Remark 2.2. The relation < defines a partial order on JC. Indeed, reflexivity and 
transitivity are immediate, and antisymmetry follows because one can build a basis 
for functions on X from elements / of , namely, the indicators of principal down- 
sets (i.e., down-sets of the form {x) '■= {y '■ y < x} with x € X). A proof from first 
principles is easy0 

We list next a few basic properties of the comparison relation ^ on /C, showing 
that the relation is preserved under passages to limits, mixtures, and direct sums. 
The proofs are all very easy. Note also that the class F of stochastically mono- 
tone kernels with stationary distribution tt is closed under passages to limits and 
mixtures, and also under (finite) products, but not under general direct sums as in 
part (c). 

Proposition 2.3. 

(a) If Kt di Lt for every t and Kt — > K and Lt — > L, then K ^ L. 

(b) // Kt di Lt fort = 0,1 and < A < 1, then 

(1 - A)A'o + XKi ^ (1 - X)Lo + XLi. 

(c) Partition X arbitrarily into subsets Xq and Xi, and let each Xi inherit its 
partial order and stationary distribution from X . For i = 0, 1, suppose Ki ^ Li on 
Xi. Define the kernel K (respectively, L) as the direct sum of and Ki (resp., 
Lq and Li). Then K ^ L. 

The following proposition, showing that ^ is preserved under product for stochas- 
tically monotone reversible kernels, is the main result of this section. 

Proposition 2.4 (Comparison Inequalities). Let Ki, . . . ,Kt and Li, . . . ,Lt 

be reversible [i.e., L'^{'k)- self- adjoint] kernels all belonging to T , and suppose that 
Kg Ls for s = I, . . . ,t. Then the product kernels Ki ■ ■ ■ Kt and Li ■ ■ ■ Lt ( and 
their time-reversals) belong to T , and K\ ■ ■ ■ Kt d Li ■ ■ ■ Lt. 

^We need only show that the indicator function 'i.{x} of ^-iiy singleton {x} can be written as a 
linear combination of indicator functions of principal down-sets. But this can be done recursively 
by starting with minimal elements x and then using the identity 
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The application to time-homogeneous chains is the following immediate corollary. 

Corollary 2.5. // A', i e J-" are reversible and K < L, then for every t we have 
K\L* e T and ^ L*. 

Remark 2.6. As we shall see from examples, the applicability of our new technique 
of comparison inequalities is limited (i) by the monotonicity requirement for mem- 
bership in J-' and (ii) by the extent to which T is ordered by ^ . But restriction (i) in 
the choice of kernel has the payoff (among others) that the perfect simulation algo- 
rithms (see jSg for background) Coupling From The Past [ini|ll|lll[M| and FMMR 
(Fill-Machida-Murdoch-Rosenthal) [151 US] can often be run efficiently for mono- 
tone chains. Restriction (ii) needs to be explored thoroughly for interesting and im- 
portant examples. This paper treats a few examples, in Sections HI fesDeciallv l4.ip . 
[5l and [51 For discussion about the relation between our comparison-inequalities 
technique and existing techniques for comparing mixing times of Markov chains, 
see Remark 13.51 below. 

The remainder of this section is devoted to the proof of Proposition l2.41 which we 
will derive as a consequence of an extremely simple, but — as far as we know — new, 
matrix-theoretic result. Proposition [273 

The general setting is this. We are given a positive vector tt € R" and define the 
(tt) inner product as at ()2.ip . We are also given a set (not necessarily a subspace) 
W C R". Let Af„(R) denote the collection of n-hy-n real matrices. Define 

T := {matrices A £ Mn(R) for which W is invariant}. 

(This of course means that a real matrix A belongs to T if and only if Aw G W for 
every w G W.) Define a (clearly reflexive and transitive) relation ^ on A/„(R) by 
declaring that A ^ B ii 

{Ax,y) < {Bx,y) for every x,y € W. 

We observe in passing (i) that A ^ B ii and only if A* ^ B* and (ii) that the 
relation ^ may fail to be antisymmetric (but this will present no difficulty). 

Proposition 2.7. Let Ai, A2, Bi, B2 € M„(R). Suppose that A2 and Bl hath 
belong to T . 1] Ai < Bi and A2 ^ B2, then A1A2 :< B1B2. 

Proof. Given x,y G W, we observe 

{AiA2X,y) < {BiA2X,y) because A2X,y e W and Ai ^ Bi 
= {A2X,Bly) 

< {B2X, Bly) because x, B^y € W and A2 ^ B2 
= {BiB2X,y), 

as desired. □ 

The third fCorollarv l2.10p of the following four easy corollaries of Proposition [^77l 
implies Proposition 12 .41 immediatelv. by setting W = M and observing that the set 
of Markov kernels with stationary distribution tt > is closed under both multipli- 
cation and adjoint. (Similarly, Corollarv l2.5l is a special case of Corollary j^HH ) 

Corollary 2.8. Let Ai, A2, Bi, B2 be matrices all belonging to T with adjoints all 
belonging to T , and suppose that Ai ^ Bi and A2 ^ i?2- Then the matrices A1A2 
and B1B2 and their adjoints all belong to T , and A1A2 ^ BiB2. 
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Proof. This is immediate from the definition of J-" and Proposition 12.71 □ 

Corollary 2.9. Let Ai, . . . ,At and Bi, . . . ,Bt be matrices all belonging to T with 
adjoints all belonging to T , and suppose that As ^ Bs for s = 1, . . . Then the 
matrices Ai ■ ■ ■ At and Bi ■ ■ ■ Bt and their adjoints all belong to T , and A\ - ■ ■ At < 
Bi---Bt. 

Proof. This follows by induction from Corollarv l2.8l □ 

Corollary 2.10. Let Ai, . . . , At and Bi, . . . , Bt be self-adjoint matrices all belong- 
ing to T , and suppose that Ag ^ Bg for s ~ 1, . . . ,t. Then the matrices Ai ■ ■ ■ At 
and Bi ■ ■ ■ Bt (and their adjoints) belong to T , and Ai ■ ■ ■ At ^ Bi ■ ■ ■ Bt. 

Proof. This is immediate from Corollarv l2.9l □ 

Corollary 2.11. Let A and B be self-adjoint matrices both belonging to T , and 
suppose that A < B . Then, for every t = 0, 1, 2, . . . , the matrices A* and B* (are 
self-adjoint and) belong to T and A^ < B* . 

Proof. This is immediate from Corollary 12 . 101 by taking As = A and Bg = B. □ 

3. Consequences of the comparison inequality, some via majorization 

In this section we focus on time-homogeneous chains and show how comparison 
inequalities can be used to compare mixing times — in a variety of senses — for chains 
with the given semigroups. As we shall see in Section a useful tool in moving 
from a comparison inequality to a comparison of mixing times will be the use of 
basic results from the theory of majorization. 

3.1. Comparison inequalities and domination. Recall from Section [5] that T 
denotes the class of stochastically monotone Markov kernels on a given finite par- 
tially ordered state space X that have a given tt as stationary distribution. Our 
next result (Proposition l3.2P gives conditions implying that if a comparison inequal- 
ity holds between reversible kernels K, L Cz J-, then the univariate distributions of 
the corresponding Markov chains satisfy corresponding stochastic inequalities. The 
proposition utilizes the following definition. 

DeEnition 3.1. Let (Yj) and {Zt) be stochastic processes with the same finite 
partially ordered state space. If for every t we have Yt > Zt stochastically, i.e., 

(3.1) P{Yt e D) < P{Zt e D) for every down-set D in the partial order, 

then we say that Y dominates Z . 

Proposition 3.2. Suppose that K,L £ F are reversible and satisfy K ^ L. If Y 
and Z are chains (i) started in a common pmf tt such that tt/tt is non-increasing 
and (ii) having respective kernels K and L, then Y dominates Z. 



Proof. By Corollary 12.51 for every t we have K^, L* G and ^ L*. The desired 
result now follows easily. □ 
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3.2. TV, separation, and L^-distance. Domination (recall Definition 13. ip is 
quite useful for comparing mixing times in at least three standard senses. 

If d is some measure of discrepancy from stationarity, then in the following 
theorem we write "y mixes faster in d than does Z" for the strong assertion that 
at every time t we have d smaller for Y than for Z. 

Corollary 3.3. Consider (not necessarily reversible) Markov chains Y and Z with 
common finite partially ordered state space X , common initial distribution tt, and 
common stationary distribution n. Assume that tt/tt is non-increasing. 

(a) [total variation distance] Suppose that Y dominates Z and that the time- 
reversal ofY is stochastically monotone. Then Y mixes faster in TV than does Z. 

(b) [separation] Adopt the same hypotheses as in part (a). Then Y mixes faster 
in separation than does Z ; equivalently, any fastest strong stationary time for Y is 
stochastically smaller (i.e., faster) than any strong stationary time for Z . 

(c) [L^-distance] Assume that Y and Z are reversible. Suppose, moreover, that 
the two-step chain {Y2t) dominates {Z2t) and is stochastically monotone. Then Y 
mixes faster in than does Z . 



Proof. All three results are simple applications of the domination inequality ()3.ip 
[which, in the case of part (c), is guaranteed only for even values of t\ or its immedi- 
ate extension to expectations of non-increasing functions. We make the preliminary 
observation that 'P{Yt = i)/7T{i) is non-increasing in i for each t; indeed, writing K 
for the kernel of Y we have 



(3.2) 



Tr{i) 



Tr{i) 



K («,j)- 



j 



so the non-increasingness claimed here follows from the monotonicity assumptions 
about tt/tt and K* . 

(a) Choosing D in (|3.ip to be the down-set D = {i : "PiYt = i)/TT{i) > 1} we find 

TVy(i) = PiYt eD)- 7r{D) < P{Zt e D) - 7r{D) < TVz(<). 

(b) We first observe 



sepy(t) 



n{i) 



tt{xi) 



for some maximal element xi in A". Therefore, choosing D ~ A'\{a;i} we find 

P{Yt = xi) 



sepy (t) = 1 — 



< 1 - 



tt{xi) 
tt{xi) 



< max 



PiZt 



n{i) 



(c) Using routine calculations suppressed here, one finds that the squared L^(7r)- 
distance (of the density with respect to tt) fi:'om stationarity for Yt equals 



7r{i) 



PjYt ^ i) 
n{i) 



E 



Ep(^2. 
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But tt/tt is non- increasing and l2t > -^2* stochastically; so this last expression does 
not exceed 

N n 2 



7r(i) 

which is the desired conclusion. □ 



We remark in passing that a very similar proof as for Corollary 13. Sf b) gives the 
analogous result for the measure of discrepancy 

" pm = _ ^ ^ 

_ 7r(i) J ' 

and so we also have the analogous result for the two-sided measure 
(3.3) max — — — — - — 1 



max 



7r(i) 

Remark 3.4. [i^-distance revisited] We have limited the statement of Corol- 
larv l3.3f cl to reversible chains for simplicity. The same proof shows, more generally, 
for each t that if (i) K and L are (not necessarily reversible) kernels with common 
stationary distribution tt, (ii) tt/tt is non-increasing, and (iii) nK^K* > ttL*L* 
stochastically, then the L^(TT)-distance from stationarity for Yt does not exceed that 
for Zt, where the chains Y and Z have respective kernels K and L and common 
initial distribution tt. Assuming (i)-(ii), for the stochastic inequality (iii) here it is 
sufficient that K and L and their time-reversals K* and L* are all stochastically 
monotone and K < L. 

Remark 3.5. [concerning eigenvalues] (a) if K and L are ergodic reversible 
kernels in JF (with a common stationary distribution tt) and we have the comparison 
inequality K < L, then the SLEM for K is no larger than the SLEM for L. This 
follows rather easily from Proposition 13.21 and Corollary I3.3f c) using the spectral 
representations of the kernels and the ample freedom in choice of the common initial 
distribution tt such that tt/tt is non- increasing. We omit further details. 

(b) There are several existing standard techniques for comparing mixing times of 
Markov chains, such as the celebrated eigenvalues-comparison technique of Diaconis 
and Saloff-Coste [2, but none give conclusions as strong as those available from 
combining Proposition 13.21 and Corollary 13.31 On the other hand, comparison of 
eigenvalues requires verifying far fewer assumptions than needed to establish /\ , L e 
F and a comparison inequality K ^ L, so our new technique is much less generally 
applicable. 

3.3. Other distances via majorization. We now utilize ideas from majorization; 
see |21) for background on majorization and the concept of Schur-convexity used 
below. For the reader's convenience we recall that, given two vectors v and w in 
R^ (for some N), we say that v majorizes w if (i) for each fc = 1, . . . , the sum of 
the k largest entries of w is at least the corresponding sum for v and (ii) equality 
holds when k ~ N. A function (f> with domain D C R^ is said to be Schur-convex 
on D if (t>{v) > (t>(w) whenever v,w £ D and v majorizes w. Thus, given any two 
pmfs pi and p2 on X, if pi majorizes p2, then for any Schur-convex function cf) 
on the unit simplex (i.e., the space of pmfs) we have 4>{pi) > 4'{P2)- Examples of 
Schur-convex functions are given in Example 13 . 81 below: for each of those examples, 
the inequality (/>(pi) > (/>(p2) can be interpreted as "p2 is closer to tt than is pi". 
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The next proposition describes one important case where we have majorization 
and hence can extend the conclusions "F mixes faster in d than does Z" of Corol- 
larv l3.3l to other measures of discrepancy d. Note the additional hypothesis, relative 
to Corollary [231 that tt is non-increasing. 

Proposition 3.6. Consider (not necessarily reversible) Markov chains Y and Z 
with common finite partially ordered state space X , common initial distribution tt, 
and common stationary distribution tt. Suppose that both tt and tt/tt are non- 
increasing. Suppose, moreover, that Y dominates Z and that the time-reversal of Y 
is stochastically monotone. Then, for all t, the pmf TTt of Zt majorizes the pmf at 
ofYt. 



Proof. As noted just above (|3.2p . the ratio P(Yt = i)/TT{i) is non-increasing in z; 
since 7r(i) is also non-increasing, so is the product P(Yt = i). Hence for each k < \X\ 
there is a down-set Dk such that F{Yt G Dk) equals the sum of the k largest values 
of P(yt = i). Since Y dominates Z, inequality ()3.ip implies that, for all t, the pmf 
TTt of Zt majorizes the pmf at of Yt. (This can be equivalently restated in language 
introduced in [T3]: Zt is coarser than Yt, for all t.) □ 

Corollary 3.7. Suppose that K,L G T are reversible and satisfy K ^ L, and that 
their common stationary distribution tt is non-increasing. If Y and Z are chains 
(i) started in a common pmf tt such that tt/tt is non-increasing and (ii) having 
respective kernels K and L, then, for all t, the pmf TTt of Zt majorizes the pmf at 
ofYf 



Proof. The desired conclusion follows immediately upon combining Propositions [ 
andEIS □ 



Example 3.8. In this example we show when tt is uniform in Proposition 13.61 (or 
Corollarv l3.7p , then Y mixes faster than does Z in more senses than TV, separation, 
and L'^. 

Write N for the size of the state space X. Then each of the following six functions 
is Schur-convex on the unit simplex in R^: 

i/p 

(for any 1 < p < oo). 



(j)i{v 

02 (w 

03 (W 

04 (U 

05 (W 

06 (W 



-IIP 



i 

max I Nvi — 1 1 , 

i 

max(l — Nvi)^ 

i 

iE("."'-"-"f . 

^v, HNv,); 



in [2TI Chapter 3], see Sections I.l, I.l, A. 2, I.l.b. D.5, and D.l. respectively. 
Therefore, if pi majorizes p2, then p2 is closer to tt than is pi in each of the 
following six senses (where here tt is uniform and we have written the discrepancy 
from TT for a generic pmf p): 
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(i) iP-distance 



for any 1 < p < oo; 
(ii) L°°-distance 



7r(i) 



- 1 



■K{i) 



1 



also called relative pointwise distance: 
(iii) separation 

n{i) 



max 



1 



(iv) Hellinger distance 




(v) the Kullback Leibler divergence 

Dkl{tt\\p) = - ^TT{i)\n 

i 

(vi) the Kullback Leibler divergence 



Tr{i) 



Of course, the L^-distance considered in Corollary 13. 3f c) is the special case p = 2 
of example (i) here, and the TV distance of Corollarv lS.Sf a) amounts to the special 
case p = I. Relative pointwise distance was also treated earlier without use of 
majorization at p.3p . 

4. Fastest mixing on a path 

We now specialize to the path-problem. Let K be any symmetric birth-and-death 
transition kernel on the path {0,1,..., n}, and denote K{i, i + 1) = K{i + 1, i) by 
Pi [except that K{0, 0) 
we have 



1 — Po Btiid K{n, n) = 1 — p,i_i]; for example, when ?i = 3 



K = 



1 - Po Po 

Po 1 - Po - Pi Pi 

pi 1 - Pl - P2 P2 

P2 1 - P2 

In this section we first show, in Sections l4.1H4.2[ that if one restricts attention either 

(i) to monotone chains, or 

(ii) to even times, 

then the uniform chain U with kernel Kq where pi = 1/2 satisfies a favorable 
comparison inequality in comparison with the general /C-chain, and we can apply 
all the results of Section|3l Then, in Section l473l we show that the parity restriction 
in (ii) can be removed to conclude that the uniform chain is, among all symmetric 
birth-and-death chains, closest to uniformity (in several senses) at all times. In this 
section and the next we make use of the general observation that a discrete-time 
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birth-and-death chain with kernel K on X = {0, 1, . . . , n} is monotone if and only 
if 

(4.1) K{i, i + 1) + i^(i + 1, i) < 1 for j = 0, . . . , n - 1. 

Before we separate into the two cases (i) and (ii) for the path-problem, let us 
note that if / is the indicator of the down-set {0, 1, . . . ,f\, then K f satisfies 



(4.2) {Kf), = 



l ifO<j<£-l 

1 — Pi if i = I 

\Pe ifj=£+l 

^ otherwise 



(with pn = 0); hence if g is the indicator of the down-set {0, 1, . . . , m}, then 

{m +1 if < m < £ - 1 

i+1 if £ + 1 < m < n. 

4.1. Restriction to monotone chains. Applying (|4.ip . our symmetric kernel K 
is monotone if and only if < 1/2 for i = 0, . . . , ri— 1. Among all such choices, it is 
clear that (|4.3p is minimized when K = Kq. From Remark l2.1f i) it therefore follows 
that Kq :< K and hence from Section |3] (especially Corollary 13 . 71 and Example 13. 8p 
that Kq is fastest-mixing in several senses. 

Remark 4.1. In fact, from (|4.3p we see that monotone symmetric birth-and-death 
kernels K are monotonically decreasing in the partial order ^ with respect to 
each Pi . 

4.2. Restriction to even times. In the present setting of symmetric birth-and- 
death kernel, note that our restriction (simply to ensure that K is a kernel) on the 
values Pi > is that pi + Pi+i < 1 for i = 0, . . . , n — 1. It is then routine to check 
that is (like A^) reversible and (perhaps unlike K) monotone. Indeed, if / is 
the indicator of the down-set {0, !,...,£}, then K^f satisfies 

1 if0<j<£-2 
1-pe-iPi 

l-2pe + 2pj + pe-ipe if j = £ 

2pi - 2p1 ~ pipi+i if j = ^ + 1 

PtPt+i ii j = 1 + 2 

otherwise. 



(4.4) {K^f), 



which is easily checked to be non-increasing in j. 

Suppose now that g is the indicator of the down-set {0,1,..., m}. Then us- 
ing (|4.4p we can calculate, and subsequently minimize over the allowable choices of 
Pq, . . . ,p„_i, the quantity {K'^f,g) by considering three cases: 

(a) Suppose m ~ I. Then 

{n + l){K^f,g)=l+{l-p,f +pI 

is minimized (regardless of value (.) when pi = 1/2 for z = 0, . . . , 7i — 1. 



12 



JAMES ALLEN FILL AND JONAS KAHN 



(b) Suppose i and m differ by exactly 1, say, m = ^ + 1. Then 

(n + l)(/v V,.g) = £ + (1 - Pi)+Pt{l - pe+i) =£+l- Pm+i 
is minimized (regardless of £) when pi ~ 1/2 for i = 0, . . . , n — 1. 

(c) Suppose £ and m differ by at least 2, say, to > ^ + 2. Then 

(n + 1) (A^V, .9) = ^ + (1 - + W + = £ + 1 

doesn't depend on the choice of the vector p. 

From Remark [2.1f i) it therefore follows that Kq ^ K'^ and hence (from Section[3]) 
that Kq is fastest-mixing in several senses. Specifically: 

(4.5) for all even t, the pmf tt^ of Xt majorizes the pmf at of Ut 

if X and U have respective kernels K and Kq and common non-increasing initial 
pmf 7T. Further, when we consider all symmetric birth-and-death chains started in 
state 0, it follows from Corollary I3.3f c) that the chain with kernel Kq is fastest- 
mixing in (without the need to restrict to even times, nor to monotone chains). 

Remark 4.2. From the above calculations we see more generally that if K and K 
are two symmetric birth-and-death kernels and for every i we have 

\pi - 5I > |Pi ~ and PiPi+i < PiPi+i, 

then K^ <K^. 

4.3. Removal of parity restriction. Throughout this subsection all chains are 
assumed to start at state 0, even when we do not explicitly declare so. The main 
result of this section is the following theorem, which extends (|4.5|) to all times 
< = 0, 1, 2, . . . and therefore demonstrates (by Example 13. 8p that the uniform chain 
is fastest to mix in a variety of senses. 

Theorem 4.3. Let X he a birth-and-death chain with state space X ~ {0, 1, . . . , n} 

and symmetric kernel, and let U be the uniform chain. Suppose that both chains 
start at 0, and let nt (respectively, at) denote the probability mass function of Xt 
(respectively, Ut). Then 

TTt majorizes at for all t. 

Let X have kernel K as described at the outset of Section Let Ut and Et 
denote the cumulative distribution functions (cdfs) corresponding to nt and at, 
respectively: for example, 

St(j) :=^cTt(*)=P(C/t<j). 

1=0 

From Section 14.21 we already know that if t is even then 

(4.6) nt(i) > Sf(i) for aU i, 

because then tt^ majorizes at and both pmfs are non-increasing. 

We build to the proof of Theorem 14.31 by means of a sequence of lemmas. We 
start with a few results about the uniform chain. 
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Lemma 4.4. 

(a) For every time t, the pmf at is non-increasing on its domain {0, . . . , n}. 

(b) The distribution "evolves by steps of two", depending on parity: for i = 
0, . . . , n — 1 we have 

at {i) = crt(z + 1) if t + i is odd. 

(c) For every time t, the cdf Y,t is concave (at integer arguments): 

(4.7) 2I]t(i)>St(i + l) + St(i-l), i>0. 

(d) The inequality (|4.7[) is equality if i > and t and i have opposite parity: 

2J:t{i) = I]t(i + 1) + J:t{i - 1) if i + Hs odd. 

Proof, (a) This was proved in a more general setting just above p.2|) . 

(b) We use induction on t. The base case t = is obvious (0 = 0). 

Using the induction hypothesis at the second equaUty, we conclude, when t and 
1} have opposite parity, that 

<Tt{i) = i [crt-i{i -l)+at{i + 1)] = ^ [<Jt-i{i) + (Tt-i{i + 2)] = crt{i + 1). 

Similarly, when t is odd we have 

at(0) = i [(7t_i(0) + atil)] = i [at_i(0) + at-i{2)] = at{l). 

(c) We first remark that it is well known that (|4.7p is indeed equivalent to 
concavity of at integer arguments. We then need only note that (|4.7p is merely 
a rewriting of the monotonicity in part (a). Indeed, 

(4.8) 2St(i) = St(i + 1) + I]t(i - 1) + cjtii) - at{i + 1) 

> I]t(^ + l) + I]^(^- 1). 

(d) Again using the equality at (|4.8p , this is merely a rewriting of the "steps of 
two" evolution in part (b). □ 

Lemma 4.5. For any time t and any state i, ifllt{j) > St(j) for all states j in 
[i^2,i + 2], then Ilt+2{i) > Si+2(«). 

Proof. In the following calculations, we lean heavily on the fact that we are dealing 
with birth-and-death chains. Utilizing natural notation such as K^{h,< i) for 
J2j<i ^^(^j j)i fiiid using summation by parts that 

i+2 

nt+2(i) = 5I'^*W^^''('^'^^) 

h=0 

= Y^Mj) [K^{j,<i)-K\j + l,<t)] 

i+2 

= Ilt{.^)[K'{j,<i)~K'{.^ + l,<^)]■ 

Recalling that is monotone, the expression in square brackets here is nonnega- 
tive, so first by hypothesis and then by reversing the above steps (now with E in 
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place of n) we have 

i+2 i+2 

nt+2W> Mj)[K'{j,<i)-K'{j + l,<i)]=Y.'yt{h)K\h,<i). 

But Kq :< (as noted in Section and at is non- increasing [Lemma l4.4f a)]. so 
we finally conclude 

i+2 

as desired. □ 

An immediate consequence is the following: 

Lemma 4.6. If po < 1/2, then 114(1) > Sf(i) for all times t and all states i. 

Proof. As previously discussed, we need only consider odd times, for which the 
proof is immediate by induction using Lemma 14.51 once the basis t = 1 is handled. 
But indeed 

ni(o) = i-po>^ = Si(o) 

and ni(i) = 1 = Si(f) for i > 1. □ 

We can also prove that Ilt{i) > 5]f(i) for all t if the transition probability from i 
to i + 1 is sufficiently low: 

Lemma 4.7. For any state i such that pi < 1/2, we have Tltii) ^ ^t(*) for all 
times t. 

Proof. We begin with the observation that, by last-step analysis, 

Ut{i) = Ut-iii - 1) + Trt-iii)il - Pi) + TTt-iii + ^)p^, 
which can be rewritten in terms of cdfs as 

Ut{i) = p^I^t^l{i + 1) + (1 - 2p,)Ilt-l{l) + p^Ut-l{^ ~ 1) 

in general and as 

£*(*) = iSt_i(i + l) + iEt_i(i-l) 

for the uniform chain. 

Again we need only prove the lemma for odd times t, and then we find 

nt{i) = p^Ilt-i{i + 1) + (1 - 2p^)Ilt-i{t) + p^Ut-l{^ - 1) 

> p,St_i(i + 1) + (1 - 2p,)^t-iii) + P^^t-l{i - 1) 

> iSt_l(^ + l) + iEt_l(^-l) 
-St(*), 

where we know the first inequality holds because t — 1 is even (whence Ilt-i domi- 
nates St-i) and Pi < 1/2, and the second inequality follows from concavity of Sf_i 
[Lemma l4.4f c)] again using < 1/2. □ 



We can now combine Lemmas 14.51 and 14.71 to prove: 
Lemma 4.8. If Pi < 1/2 and pi^i < 1/2, then for all times t we have 
(4.9) Tlt{j)>i:t{j)forallj>i + 2. 
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Proof. We need only consider odd times, and we proceed by induction on t. For 
t = 1 we have ni(j) = 1 = Si(j) for all j > 2; so we move on to the induction step. 

Suppose that (|4.9p holds with t replaced by t — 2. Use of Lemma then ensures 
that we in fact have nt_2(j) > ^t-2{j) for all j > i. Hence for any j > i + 2 we 
have nt_2(^) > St-2(^) for all £ e [j - 2,j + 2] and therefore, by Lemma l4?5l 

nt(j) > ^t{j). □ 

Lemma 4.9. Ift + i is even, then 

Proof. We may assume that t and i are odd. In light of Lemma [4.61 '^g may also 
assume po > 1/2. Let 2£ be the first state where the alternation of pi's greater than 
and no greater than 1/2 is broken: 

P2e < 5, 

(4.10) VO < m < £ : p2m > \ and p2m+i < \- 

(If there is no such break, we define 2^ to be n + 1 or n + 2 according as n is odd or 
even.) Notice that the break happen only at an even state, since two consecutive 
Pi's cannot both exceed 1/2. 

Since i is odd, we have either i < 2i or i > 2£. In the former case, condition (|4.10p 
implies Pi < 1/2, and Lemma [4.71 proves that Ilt{i) > S((i). In the latter case, we 
must have 2€ < n — 1 in order for i to be a state; we then observe that p2£-i < 1/2 
and p2e < 1/2, and then llt{i) > St(i) by Lemma H751 □ 

We are now prepared to complete the proof of Theorem 14.31 

Proof of Theorem \4-3\ Because the cdf inequality ()4.6p holds when either t is even 
or (by Lemma 14. 9p when t + i is even, we need only establish the asserted ma- 
jorization when t is odd and i is even. Indeed, in that case using Lemma l4.4f d) we 
have 

I]t(i) = i[I]^(^ - 1) + St(z + 1)] < ^[Utii - 1) + ^t(^ + 1)] 

< Iltii - 1) + max{7rt(i), 7rt(i + 1)}, 

and so there exist i + 1 entries of the vector tt^ whose sum is at least St(i). We 
conclude that ttj majorizes at, as asserted. □ 

Remark 4.10. (a) The multiset of values {Pi{Ut = j) : j € {0,...,n}} for the 
uniform chain U started in state i does not depend on i € {0, therefore, 
the uniform chain minimizes various distances from stationarity (including all those 
listed in Example 13. 8 P not only when the starting state is but in the worst case 
over all starting states (and indeed over all starting distributions). 

To see the asserted invariance in starting state, consider simple symmetric ran- 
dom walk V on the cycle {0, . . . , 2n -I- 1}, with transition probability 1/2 in each 
direction between adjacent states (modulo 2n + 2) . Then for every {0, . . . ,n} 
we have (by regarding states n -f 1, . . . , 2n -t- 1 as "mirror refiections" of the states 
n, . . . , 0, respectively) 

P,([/t = j) = P,{Vt = j) + P^iVt = 2n + 1 - i), 

where at most one of the two terms on the right — namely, the one with j — i = t 
(modulo 2) — is positive. Thus, as multisets of 2n + 2 elements each, we have the 



16 



JAMES ALLEN FILL AND JONAS KAHN 



equality 

{P,(C/t = j) : j e {0, . . . , n}} U {0, . . . , 0} = {P,(Ft = j) : j e {0, . . . , 2n + 1}}, 

where the multiset {0, . . . , 0} on the left here has (of course) n + 1 elements. Since 
the multiset on the right clearly does not depend on i, neither does {Pi{Ut = j) : 
je{0,...,n}}. 

(b) The SLEM (second-largest eigenvalue in modulus) is an asymptotic measure 
(in the worst case over starting states) of distance from stationarity. Accordingly, 
by remark (a) , the uniform chain minimizes SLEM among all symmetric birth-and- 
death chains. Thus we recover the main result of j5]. 

5. Fastest-mixing monotone birth- and-death chains 

Let n be a positive integer and consider the state space X = {0, . . . , n}. Let tt be 
a log-concave distribution on X, and consider the class of discrete-time monotone 
birth-and-death chains with state space X and stationary distribution tt, started 
in state 0. In this section we identify the fastest-mixing stochastically monotone 
chain in this class as having kernel (call it Kt^) with (death, hold, birth) probabilities 
{qi,ri,pi) given for i € A" by 

(5.1) Ql = ■ , n ■ ■ -, p. - 



TI'l-l+TTi' (tTj-I + 7rj)(7ri -I- TTi+l) ' TTi -|- TT^+l ' 

with 7r_i := and TTn+i 0. In Section [5T] we first find the FMMC when tt is 
held fixed; then in Section 15.21 we show that, when tt is allowed to vary, taking it 
to be uniform gives the slowest mixing in separation. 

Throughout, we make heavy use of reversibility. Recall that any irreducible 
birth-and-death chain on X is reversible with respect to its unique stationary dis- 
tribution TT. 

5.1. The FMMC when tt is fixed. The main result of this subsection is the 
following comparison inequality; and then Proposition 13.21 and Corollary 13 . 31 estab- 
lish three senses (TV, separation, and L^) in which the chain with kernel iC^ is 
fastest-mixing. 

Theorem 5.1. Let tt be log-concave on X = {0, . . . ,n\. Let /iT^ have (death, hold, 
birth) probabilities {qi,ri,pi) given by (j5.ip . Then if^ is a monotone birth-and- 
death kernel with stationary distribution tt, and K.^ ^ K for any such kernel K . 

Proof. Since for each i the numbers qi , r.i , pi are nonnegative (r^ because of the 
log-concavity of tt) and sum to unity, K^^ is indeed a birth-and-death kernel. Since 
TTiPi = 7ri_|_igi_|_i, it is reversible with stationary distribution tt. Since pi -\- g^+i = 1, 
it satisfies the inequality (|4.ip and so is monotone. 

We now consider monotone birth-and-death kernels K with stationary distribu- 
tion TT and general {qi,ri,pi). We prove if^ :^ K hy extending the calculations in 
Section S] and in particular in Section 14.11 Note that if / is the indicator of the 
down-set {0, !,...,£}, then Kf satisfies 



(5.2) (Kf), = 



if < j < £ ^ 
-Pi if j = £ 
qe+i if j = £ + 1 
otherwise; 
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hence if g is the indicator of the down-set {0, 1, . . . , ?ti}, then 

[ Ej=o ^3 if < m < £ - 1 

(5.3) {Kf, g) ^ I T!j=o - ■^ePe if m = £ 

I Z)j=o ""j if £ + 1 < TO < n. 

Monotonicity (|4.ip requires precisely that for each £ = 0,...,n— Iwe have 

T^e \ 

1 H = P£ + < 1: 



so clearly K^^ K- D 

Remark 5.2. We see more generally that the kernels K ^ F are non-increasing 
(in <) in each pi and that pi ~ '^i+i/i'n'i + '"'i+i) maximizes subject to the 
monotonicity constraint. (This remark generalizes Remark 14.10 We observe in 
passing that the identity kernel / is the top element (i.e., unique maximal element) 
in the restriction of the comparison-inequality partial order ^ to monotone birth- 
and-death chains. 

Example 5.3. Suppose that the stationary pmf is proportional to tt, = p*, i.e., is 
either truncated geometric (if p < 1) or its reverse (if p > 1) or uniform (if p = 1). 
Then the kernel A'^r corresponds to biased random walk: 

(5.4) qi = q:= , n = 0, pi = p := ^ 



1+ P 1 + P 

with the endpoint exceptions, of course, that go = 0, = q, r„ ~ p, Pn ~ 0. 

5.2. Slowest FMMC: the uniform chain. In this subsection we consider the 
monotone FMMCs given by (|5.1|) for log-concave pmfs tt and show (Theorem 15. 9p 
that the uniquely slowest to mix in separation (at every time t) is obtained by 
setting TT = uniform. Our first two results of this subsection consider ergodic 
birth-and-death chains and their so-called strong stationary duals and do not need 
any assumption about log-concavity of tt. By "ergodic" we mean that the chain 
is assumed to be aperiodic, irreducible, and positive recurrent (the third of which 
follows automatically from the first two since our state space is finite) and so settles 
down to its unique stationary distribution. 

Proposition 5.4. Let X be an ergodic monotone birth-and-death chain on X ~ 
{Q, ... ^n] with stationary pmf TT , (death, hold, birth) transition probabilities {qi,ri,pi) 
satisfying 

(5.5) qi+i-^pi = l (i = 0, . . . ,n - 1), 

and initial state 0. Let H denote the cdf corresponding to tt, with := 0, and 
set 

(5.6) q* = -J^P^^ = 0> Pi = (i = 0, . . . , n - 1). 
Then 

sep(0=P(T>i) (t = 0,l,...), 
where the random variable T is the hitting time of state n for the birth-and-death 
chain X* with initial state and transition probabilities (|5.6|l . 

Proof. The chain X* is called the strong stationary dual (SSD) of X, and the 
proposition is an immediate consequence of SSD theory [U Section 4.3]. □ 
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Example 5.5. For a biased random walk as discussed in Example I5.3| the dual 
kernel is 

It is easy to check that we obtain the same dual kernel for ratio p^^ as for p. Thus 
if q and p are interchanged in a biased random walk with no holding except at the 
endpoints, then the two chains mix equally quickly in separation. 

This can be seen another way: More generally, if the state space is a partially 
ordered set possessing both bottom (0) and top (1) elements, then for any ergodic 
kernel K such that both K and the time-reversal K are stochastically monotone, 
the chain K from and the chain K from 1 mix equally quickly in separation. 
Indeed, it is easy to see that for every t we have, in obvious notation, 

sep5(i) = 1 = 1 = sepj(t). 

Lemma 5.6. Let K and L he two ergodic monotone birth- and- death chains on 
X = {0, . . . ,n}, both started at 0, with possibly different stationary distributions. 
Suppose that K{i + 1, i) + K(i., i + 1) = 1 = L{i + !,«) + L{i, i + 1). Consider the 
notation of (|5.6p and suppose also that p* arising from Y is at least p* arising 
from Z for all i ~ 0, . . . ,n. Then Y mixes faster in separatioi^ than does Z . 



Proof. Let Y* and Z* be the corresponding SSDs, as in Proposition l5.4l An obvious 
coupling gives Y^* > Z^ for every t, and the lemma follows. It is worth pointing out 
that while the dual chains may not be monotone, this causes no problem with the 
coupling because Yf* and Z^ must have the same parity for every t; that's because 
the holding probabilities for both dual chains all vanish. □ 

Next, given a FMMC for log-concave tt, we show that it mixes faster in separation 
than does a certain biased random walk. 

Theorem 5.7. Consider the fastest-mixing monotone birth- and- death chain X with 
log-concave stationary pmfn, kernel ()5.ip . and initial state 0. Define 



Pi := TTi+i/TTi (i = 0,...,n- 1), 

and suppose that i ~ io minimizes \ lnpi\. Then X mixes faster in separation than 
does the biased random walk (|5.4p with p set to pi„ . 

Proof. Log-concavity is precisely the condition that pk is non-increasing in k. Hence 
p* satisfies 

, Hl+l TTj 
Pt = 



Hi TTi + TTi+l 

Hi 



(5-7) > ( 1 + ^' u ,) ) X = MP')' 



^Recall our terminological convention stated in the paragraph preceding Corollary 13.31 
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where the function 

i + 2 



1 - 1 

(5.8) hip) := 



with 



2(2+1) 



1 - p'+i 1 + p 

satisfies fi{p~^) = fi{p) and can be shown by induction on i to be non- increasing 
in p < 1 (and strictly so for i > 1). The induction step uses the fact that 

^'^^^^^"(l+p)V.-i(p) 
together with the induction hypothesis and the (strict) increasingness of the func- 
tion pi-)- p/(l-|- p)^ for p < 1. Therefore 

P* > MPio), 

and this last expression is the dual birth probability from state i for the biased 
random walk with ratio pi„ . The conclusion of the theorem now follows from 
Lemma 15.61 □ 



So the question as to which of the FMMCs (|5.1|) is slowest to mix is reduced to 
finding the slowest biased random walk. But we've already done the calculations 
needed to prove the following result: 

Theorem 5.8. Consider biased random walks as in Examvle \5.3l each with initial 
state 0. The walks are monotonically slower to mix in separation as minjp/g, (jf/p} 
increases. 

Proof. We have already noted at Example 1 5 . 5 1 1 hat the speed of mixing is invariant 
under interchange of p and q. Moreover, as p = p/q increases over (0, 1], the chains 
are monotonically slower to mix in separation because we have equality in (|5.7p and 
hence 

Pi = Mp)^ 

which (as shown in the proof of Theorem 15. 7|) is non-increasing in p < 1. □ 

The next theorem is the main result of the subsection and is an immediate 
corollary of Theorems 15.71 and 15.81 

Theorem 5.9. Among the fastest-mixing monotone birth- and- death chains (|5.ip 
with initial state and log-concave stationary pmf it, the uniform chain is slowest 
to mix in separation. 

Remark 5.10. How fast does an ergodic monotone birth-and-death chain mix in 
separation? We have addressed this question in general in Proposition 15.41 and in 
the last sentence of Example 15.51 The biased random walk (|5.4p is treated in some 
detail in [TTl Section XVI.3]. We note: 

(a) The eigenvalues, listed in decreasing order, are 1 and 

2^cos^^ {j = l,...,n). 

n + 1 

(b) Fix p and consider ?i — > oo. Let p = |p — g| denote the size of the drift of 
the walk. If p 7^ (i.e., p 7^ 1), there is a "cutoff phenomenon" for separation at 
time t — fm + CpU^^'^ . This means (roughly put) that separation is small at that 
time t when Cp is near ~oo and large when it is near -f cxj, with the subscript in Cp 
indicating that the definition of "near" depends on p. 

(c) If p = 1 (the uniform chain), it takes time of the larger order for separation 
to drop from near 1 to near 0, and in this case there is no cutoff phenomenon. 
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6. LOVASZ-WlNKLER MIXING TIMES 

In previous sections we have discussed mixing in terms of TV, separation, L^, 
and other functions measuring discrepancy. An alternative description of speed of 
convergence is provided by mixing times as defined by Lovasz and Winkler |20j; 
according to their definition (reviewed below), and unlike for our previous notions 
of mixing, one number ["the mixing time", T^-iixiX)] is assigned to each chain X. 

In this section we compute T,iiix(^) for any irreducible birth-and-death chain X 
started at and then revisit the FMMC problems of the preceding two sections 
using Tinix as our criterion. One highlight is this: For the path-problem on X ~ 
{0, . . . , n}, we show that the uniform chain is the fastest-mixing symmetric birth- 
and-death chain in the sense of Lovasz and Winkler |20| if and only if n is even, 
and we identify the fastest chain when n is odd. 

According to the definition in |20] , the mixing time for any irreducible (discrete- 
time) finite-state Markov chain X having stationary distribution tt is the (attained) 
infimum of expectations of randomized stopping times for which tt is the distribution 
of the stopping state. In symbols, 

(6.1) T„i,(X) :=minE5 

where the infimum is taken over randomized stopping times S such that the distri- 
bution oi Xg is TT. For computing T,„ix(X), a very useful theorem from [50] asserts 
that a randomized stopping time S achieves the minimum in (|6.ip if and only if 
it has a halting state, that is, a state x such that ii Xt = x then (almost surely) 
S < t. We will use this result to compute Tn-iix{X) for any irreducible birth-and- 
death chain in Theorem 16. 2| but first we state a lemma about expected hitting 
times for birth-and-death chains. 

Lemma 6.1. For an irreducible birth-and-death chain on X = {0, . . . , n} (in dis- 
crete or continuous time) with stationary distribution tt and initial state 0, let T 
denote the hitting time of state n. 

(a) In discrete time, denote the birth probability from state i by pi. Then 

n-1 



i=0 ^ k=0 

(b) In continuous time, denote the birth rate from state i by Xi. Then 

n— 1 i 
i=0 ' ' k=0 

Proof. Each assertion is easily established, and each follows immediately from the 
other; for (b), see, e.g., [HI Chapter 4, Problem 22]. □ 

Theorem 6.2. Let X be an irreducible (discrete-time) birth-and-death chain on 

X = {0,...,n} with stationary pmf (respectively, cdf) tt (resp., H) and initial 
state 0. Then 

wx) = E^^^^^^- 

— T^iPi 
2=0 ^ 

Proof. Let us use the naive rule S as our randomized stopping time: Choose j 
randomly according to tt, and then let S be the hitting time of j. Obviously the 
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stopping distribution is tt, as required. Moreover, the state j must be hit en route 
to n; hence 71 is a halting state and S achieves the minimum at ()6.1|) . 

To compute Tnu^{X) = ES, we first note that Lemma iG-lf al yields (easily) 
corresponding formulas for the expected value of the hitting time Tj of each state j: 

Therefore 

as desired. □ 

Remark 6.3. (a) The Lovasz-Winkler theory of mixing times and the statement 
and proof of Theorem 16.21 all carry over routinely to the "continuized" chain which 
evolves in the same way as the given discrete-time chain but with independent 
exponential random times with mean 1 replacing unit times. In particular, the value 
of Tmix{X) remains unchanged under continuization of an irreducible discrete-time 
birth-and-death chain X with initial state 0. 

(b) By a theorem of Aldous and Diaconis [2j Proposition 3.2] in discrete time 
and a theorem of Fill Theorem 1.1] in continuous time, any ergodic finite-state 
Markov chain X (regardless of initial distribution) has a fastest (i.e., stochastically 
minimal) strong stationary time T satisfying P(T > t) = sep(t) for every t (re- 
stricted to integer values for a discrete-time chain). If the state space is partially 
ordered with bottom element and top element 1 and the chain X starts in 0, and 
if the time-reversed kernel K is monotone, then 1 is a halting state for any such T; 
to see this, observe that 

PiXt = i,T>t) = P{Xt = i) - P(T <t,Xt^ i) 

mm (1 — sep(t)) 

i -Ki 

where tt is the stationary distribution and the penultimate equality follows from 
the monotonicity of K*. 

Now consider an ergodic birth-and-death chain X (in discrete or continuous time) 
on X ~ {0, . . . , n} with stationary distribution tt and initial state 0. In the discrete- 
time case, assume that the chain is monotone; this is automatic in continuous time 
by a simple and standard coupling argument. Then a fastest (i.e., stochastically 
minimal) strong stationary time T exists, and n is a halting state for any such T. 
It follows that Tn-iix{X) = ET and thus Theorem 16.21 also gives an expression for 
E T, which equals 

00 00 

J2p{t > t) = Y,scp{t) 

t=o t=o 
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in discrete time and equals 

/>oo />oo 

/ P{T >t)dt= sep{t)dt 
Jo Jq 

in continuous time. This remark gives added import to the value of T^i^i^) for 
any irreducible discrete-time birth-and-death chain X (whether monotone or not) 
with initial state 0: It equals the integral of separation for the continuized chain. 

(c) Given a collection C of irreducible discrete-time birth-and-death chains Y 
with initial state 0, suppose that X G C satisfies X = argminygc 7mix(^)- In 
light of remark (b), one might wonder whether the continuized chain corresponding 
to X minimizes sep(t) at every time t over all continuizations of chains Y € C. 
Theorem IB.Sf b) provides a counterexample. Indeed, it can be shown that if we 
compare the chain of the form (|6.4|) but with 6'„ changed to {n — l)/(2n) with any 
other birth-and-death chain having initial state and symmetric kernel K, then 
there exists to = to (K) such that continuized separation at time t is strictly smaller 
for the former chain than for the latter for all Likewise, in the "ladder 

game" discussed in Section [7] it is the uniform chain, not the chain discussed there, 
that is "best in separation for small t" in similar fashion. 

We are now in position to determine, for given tt, the birth-and-death chain X 
that minimizes ri„ix(^) among those having initial state 0, stationary distribu- 
tion TT, and no holding probability except at the endpoints of the state space. Un- 
like in Section [5l we do not need to restrict to monotone kernels; and rather than 
assuming that tt is log-concave, we assume instead that tt is non-decreasing. For 
the case that tt is uniform, we will give later an argument that removes the restric- 
tion about holding probabilities. [There are examples, such as tt = ^^(1, 2, 4, 4, 4), 
showing that the restriction cannot be removed in general.] 

Theorem 6.4. Let X — {0, . . . , n}. Among all irreducible birth-and-death chains X 
having a given positive non- decreasing stationary prnfir, initial state 0, and no hold- 
ing probability except at and n, there is a unique chain X^^ minimizing Tmix(X). 
Moreover, 

(a) Let Oi :~ X]}=i (^1)' "''''^ f'^^ i = 0, . . . , n — 1. Define 

f{w) := > — — ■ . 

^ {-lyw + a^ 

Then there exists a unique minimizing f{w) over w G [0,7ro], and Tniix(X7r) = 

(b) The optimal chain X-^; has transition probabilities 

qi = , r^ = 0, Pi = (i = 0, . . . , n) 

with the exceptions go = 0, 7'o = 1 — Po> = ^ ^ In, o-nd Pn ~ 0. 



"^Indeed, if Y and Z are the discrete-time and continuized chain corresponding to K, then, 
with TT denoting the uniform pmf, as 4 — > we find 

1 _ sep^(t) = = ^_,t" P{Y„ = n) ^ = ^in + l)poPi • • + o(t"+^), 

vr„ n! TTn n\ 

and poPi • • ■ Pn— 1 is uniquely maximized subject to Pk-i -hPk ^ 1 for fc = 0, . . . , n — 1 by choosing 
Pfe = -I- l)/(2n) if k is even and Pfe = (n. — l)/(2n) if k is odd. 
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Proof. We begin by noting that birth-and-death kernels with stationary distribu- 
tion TT (in complete generality, irrespective of holding probabilities or non-decreasing- 
ness of tt) are in one-to-one correspondence with nonnegative sequences w = 
(w_i, Wo, . . . , Wn) satisfying w-i = — Wn and 

(6.2) Wi-i+Wi<TTi {i = 0,...,n), 

the correspondence being Wi = TTiPi = TTi^iQi^i, z = 0, . . . , n — 1. The proof is easy, 
and the correspondence gives 

n = l- qi-pi^l (i = 0, . . . , n) 

for the holding probabilities. In this w-parameterization, Theorem 16.21 gives 

(6.3) r„.. = x:^^^tj^. 

1=0 

The constraint r; = for i = 0, . . . , n — 1 is precisely the constraint that equality 
holds in (|6.2p for i = 1, . . . , ri — 1. Then we must have w :— wq ^ [0, ttq] and 

Wi ~ (— + fti (i = 0, . . . , n — 1). 

It follows from the assumption that tt is non-decreasing that these WiS are indeed 
all nonnegative [and all positive if w g (0, ttq)]. This proves the theorem, because / 
is continuous on [0, ttq] and both finite and strictly conve s0on(O,7ro). □ 

We now specialize to the case of uniform tt, removing the restriction on holding 
from Theorem 16.41 and solving explicitly for the value w in Theorem I6.4r a). We 
find it somewhat surprising that the chain minimizing Tmix is not the uniform chain 
whenever n > 3 is odd. 

Theorem 6.5. Consider the problem of minimizing Tmix among all birth- and- death 
chains on X — {0, . . . , n} with initial state and symmetric kernel. 

(a) If n > 2 is even, then the uniform chain is the unique minimizing chain. 

(b) // n is odd, then 

,„ jl-On if A: is even n n t\ 

\0n II K IS odd 

gives the unique minimizing chain, where for any m we define 

1 



(6.5) - [v/(™' + 2)(m2-4) - (m^ - 4) 

We have written the formula for 9m-i rather than that for 9n because it is 
simpler to write. 

Remark 6.6. Although the uniform chain is not optimal when n is odd, it is nearly 
optimal, since 0„ has the asymptotics 

6'„ = i - + 0{n^^) as 71 oo 



'^In the general setting of 16.31 1 , T^jix is a strictly convex function on a nonempty convex domain 
(an intersection of half-spaces) of arguments w and so has a unique minimum. The optimal w 
is on the boundary of the domain; more specifically, for every i = 0,...,n — 1, if the optimal w 
does not lie on the hyperplane delimiting the ith half-space 16.21 . then it lies on the (i -|- l)st such 
hyperplane. 
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and the value of Tmix (recall Theorem 16. 2 p for = 1/2 is ^ri^ + + 1 , only slightly 

2 
3 



larger than the optimal value ^n^ + n+ S — |n ^ + 0{n 



Proof of Theorem 1 6'. 51 Recall Theorem 16.21 thus the goal is to minimize 



(fc + - k) 



over vectors p = (pq, . . . ,pn-i) that are nonnegative (we won't repeat this nonneg- 
ativity condition below) and satisfy 

(6.6) pk~i + Pk < I ior k ^ 0, ... ,n 

where p-i = = pn. The objective function /(p) is strictly convex in p (by 
strict convexity of a; i— > x~^). Hence there is a unique minimizer, and because 
(Pn-i, ■ • ■ ,Po) is clearly a minimizer if (po, ■ • ■ ,Pn-i) is, the unique minimizer is of 
the form 

{PO, . . . ,P(„/2)-l,P(r,/2)-l, • ■ • ,Po) 

if n is even and of the form 

{PO, ■ ■ ■ ,P{n-3)/2,P{n-l)/2,P{n-3)/2, ■ ■ ■ , Po) 

if n is odd. We now break into the two cases. 

(a) For n even, we seek equivalently to minimize 

("/2)-l , , W , X 

/(p)=2 ^ {k + l){n-k) 
k=a P^ 

subject to 

Pk-i+Pk < 1 for fc = 0, . . . , (n/2). 

[Note that the last of these conditions is P(n/2)-i < 1/2.] 

We claim (by induction on K) for 1 < K < {n/2) ~ 1 that the minimizer of 
X^aLo subject to (nonnegativity and) pk-i + Pk < 1 for fc = 0, . . . , ii' 

and Pk < 1/2 is p^ = 1/2. 

For the basis A' = 1 of the induction, we seek to minimize 

n 2(n-l) 
— + ^ 

Po Pi 

subject to po+pi < 1 and pi < 1/2. Clearly we should take po = 1 ~ pi (regardless 
of pi), and then we need to minimize 

n 2(n-l) 

1 + 

1 - pi Pi 

subject to pi < 1/2. Because 2{n — 1) > n (i.e.. n > 2), the minimizer is pi = 1/2 
(and then pq = 1/2). 

We now proceed to the induction step to move from if — 1 to K. To minimize, 
clearly we should take pK = min{l/2, 1 — pk-i}- The remainder of the proof for n 
even then breaks into two cases. 

Case 1. Ifpx-i > 1/2, then we take pa- = 1—pK-i and our goal is to minimize 

(fc + l)(ri-fc) K{n-{K-1)) {K + l){n - K) 
f-i, Pk Pk-1 i-PK-i 
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subject to pk-i +Pk < 1 for < A; < K—1 and (because this is Case 1) pK-i > 1/2. 
Because {K + l){n— K) > K{n—{K—1)) and we have the restriction p^-i > 1/2, 
we should set pk-i as smaU as possible, namely, pk-i = 1/2, and then we seek to 
minimize 

K-2 



E 



{k + l){n-k) ^K{n-{K-l)) ^ {K + l){n-K) 



subject to pk-i +Pk < 1 for < k < K — 1 and px-i = 1/2. Clearly the minimum 
value here is at least as large as the minimum value if we relax the last constraint 
to pk-1 < 1/2. But then by induction the minimum value is achieved by setting 
Pk ^ 1/2. This completes the proof in Case 1. 

Case 2. If px-i < 1/2, then we set pK = 1/2 and the goal is to minimize 

^ (fc+l)(n-fc) (/V +l)(n-X) 

subject to pk-i + Pk < i for < k < K and pk~i < 1/2. But then again by 
induction the minimum value is achieved by setting pk = 1/2. This completes the 
proof in Case 2, and thereby completes the proof of part (a). 

(b) For n odd, suppose without loss of generality that n > 3. We first prove that 
the optimum is again attained for a chain that satisfies equality in condition (j6.6p 
at interior points k of the state space: 

(6.7) Pfe-i+Pfc = l for fc = l,...,n- 1. 

Recall that the minimizing p is unique and symmetric. Hence, considering the 
holding probability j'fe := 1 — Pk-i — Pk at state fc, it suffices to show that there is 
an optimizing chain with j'fe = for 1 < fc < (n — l)/2. 

We proceed by contradiction. We show that there exists p' satisfying (|6.6p 
and /(p') < /(p) in each of the following three cases which, allowing arbitrary k e 
{1, . . . , (n— 1)/2}, exhaust all possibilities where > for some 1 < fc < (n— 1)/2: 

(i) Tfc > and rk-i > 0; 

(ii) rfc > and rk-i = and pk > 1/2; 

(iii) Pk < 1/2, and k is the largest value j in {1, . . . , (n— 1)/2} such that rj > 0. 
In case (i), let 

p'k-i Pk-i + min{rfc_i, rfc} 

and p'j := pj otherwise. 

In case (ii), first note that fc > 2; indeed, were we to have k = 1, then (by our 
assumption) rg = and so po = 1; but then pi = 0, and such a p clearly doesn't 
minimize /(p). Next, because pk > 1/2 we must have pk-i < 1/2 (because rk > 0) 
and thus pk-2 > 1/2 (because rk-i = 0). We can then let 

p'k-i ■■= Pk-i + e, p'k-2 Pk-2 - e 

for suitably small e > 0, and p'^ :— pj otherwise. Since k < (n — l)/2, we know 
k{n + 1 — fc) > (fc — l)(n + 2 — fc), so the derivative of /(p) in the direction of the 
vector 6k-i — Sk-2 is negative and /(p') < /(p). 

In case (in) we have pk+2i = Pk for < i < - k, and Pk+2i-i = ^ - Pk 
for 1 < i < _ k. We form p' by changing these values to p5c+2i •= Pfc + e and 
p'k+2i-i ■= ^ ^Pk ^ £ foi' suitably small e > and setting p'j := pj otherwise. We see 
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that /(p') < /(p) if the derivative with respect to pk of the following expression is 
negative for all pfe < 1/2: 

— V {k + 2i + l){n-k-2i) + V (k + 2i){n + I - k - 2i); 

Pk ^ — Pk 

and that is true if (and only if) the first sum is at least as large as the second. 
Indeed, the first sum is larger than the second: 



^ {k + 2i + l){n-k-2i)~ ^ {k + 2i){n + I - k - 2i) 

i=0 i=l 

— k 

= {k+l){n-k)+ {n-2k-4i) 

i=l 

= k{n-k) + 1) > 0. 

Since we have established constraint (|6.7p , every feasible vector p is of the form 

{1 — if fc is even 
6 if fc is odd, 

so we need only verify that the choice — On as defined at (|6.5|) is optimal. Indeed, 
writing r = (n — l)/2 we have 

an := + - 2j) = + + 2n + 3) 

0<j<r 

6„:= (2i)(n-2j + l) = ^(72 + l)(n-l)(n + 3), 

l<j<r 

and then the optimal choice of 9, minimizing + is 0„ given by 



1 + \J an/bn^ 



-1 



After a little bit of computation, we find that 0„ is given in accordance with equa- 
tion ([53]). □ 

7. A "ladder" game 

In this section we discuss a simple "ladder" game, where the class of kernels 
considered is a certain subclass of the symmetric birth-and-death kernels considered 
in Section |4l Our treatment involves finding the kernel that minimizes the Lovasz- 
Winkler mixing time T„^i-^. This particular kernel is not one that had previously 
been considered as a candidate for "fastest". 

Lange and Miller [T^ discusses a "ladder" game and several contexts, including 
an old Japanese scheme for choosing a spouse's Christmas gift from a list of desired 
items, in which it arises. We refer the reader to [TO] for details. A class of Markov 
chains that arise in modeling the ladder game (see "Model One" in |191 Section 5]) 
have the permutation group on {0, . . . ,n} as state space and moves that transpose 
items in adjacent positions; write pi for the probability that the positions chosen 
are i and i + 1, so that 

(7.1) PO+Pl^ ^Pn-l = 1- 
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We will refer to (|7.ip as the "ladder condition". If we follow the movement of only 
a single item (this is "Model Two: The path of a single marcher as a random walk 
among the columns of the ladder" in |191 Section 7, esp. Figure 9]), then we have 
precisely the class of symmetric birth-and-death kernels considered in our path- 
problem of SectionlU but now subject to the ladder condition. From |l9l Section 8: 
How many rungs is enough?] we have the following quote (with notation adjusted 
slightly to match that of Section |4|): 

We suspect (but have not shown) that for any n, the rate of 
convergence is maximized when rung placement is uniform. That 
is, the absolute value of the largest small eigenvalue is minimized 
when Pi ~ 1/n for i = 0, 1, . . . , n — 1. 

(Here "largest small eigenvalue" means the eigenvalue of the kernel with largest 
absolute value strictly less than 1 — what is called "SLEM" in [SlISlll].) The authors 
of |19| base their suspicion on calculations for n = 2, for which their conjecture is 
indeed true. 

The corresponding continuous-time problem has been studied by Fielder |12| and, 
in a somewhat more general setting, by Sun et al. in [351 Example 5.2]. The result 
is that, among all continuous-time symmetric birth-and-death chains on {0, . . . , n}, 
started from 0, with birth rates pi satisfying the ladder condition (j7.ip . the one 
which is fastest-mixing in the sense of minimizing relaxation time has pi propor- 
tional to (i + l){n ~ i). It can be shown that these weights also uniquely minimize 
SLEM in discrete time, so the conjecture in [T^ is false for every n > 30 

One might now suspect that these parabolic weights provide a FMMC (subject 
to the ladder condition) in a variety of senses, at least for chains (as henceforth 
assumed) starting in state 0. However, working in discrete time, it is clear (a) from 
reviewing the discussion in Section|43]that there is no bottom element with respect 
to ^ for monotone chains satisfying the ladder condition and (b) from Remark l4.2l 
that there is no bottom element in ^ for squares of ladder-condition birth-and-death 
kernels. Further, it can be shown, switching to continuous time to match the setting 
of |32| and in order to bring standard techniques to bear (it is well known that all 
birth and death chains in continuous time are monotone), that there is no ladder- 
condition birth-and-death chain minimizing separation at every time. Theorem 1 7. II 
implies that the integral of separation over all times is minimized by weights pi 
proportional to the square roots + l){n — i) of the weights minimizing SLEM. 

Theorem 7.1. For each discrete-time symmetric birth-and-death chain with state 
space {0, . . . ,7i}, initial state 0, and birth probabilities p = (pi) satisfying the lad- 
der condition (|7.ip . let /(p) denote its Lovdsz-Winkler mixing time Tmix- Then 
the uniquely optimal (i.e., minimizing) choice of p is to take pi proportional to 
^{i + l){n-i). 

Theorem 1 7. II is an immediate consequence of the following corollary to the proof 
of Theorem 16.41 taking tt to be uniform and c to be 1/n. 



^At the end of their Section 8, the authors of 1191 also wonder, based on results for n = 2, 
whether it might be the case for all n that, except for multiplicities, the eigenvalues are the same 
for the permutation chain as for the single-marcher chain. This is seen to be false by the discussion 
in [T] Section 1.4]. But the main theorem of [7] does establish that the second-largest eigenvalues 
of the two chains agree. 
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Corollary 7.2. Over all discrete- time birth- and- death chains on {0, . . . , 7i} (started 
at Q) with given stationary distribution tt (having cdf H ) and 

n-l 

TTfcPfc = c e (0, minTT,;], 

^ — ^ i 

k=0 

the mixing time Tmix of the chain is minimized by the choice 
c^Hk-iil-Hk-i) c^Hk{l-Hk) 

and the minimized value is 



Proof. As demonstrated in the proof of Theorem 16.41 the goal is to minimize 

^ mix / , 

Wi 

1=0 ' 

over nonnegative sequences (lU-i, wq, . . . , Wn) satisfying W-i = = w„ and 
(7.2) Wi^i +Wi<TT, {i = 0, . . . ,n) 

and X]fe=o ~ Ignoring the constraint (|7.2p . the optimal choice of the weights 
Wi is clear, namely, Wi = irtpi with pi as asserted in the statement of the theorem. 
But then ()7.2p is automatically satisfied because we assume c £ (0,mini7ri]. Eval- 
uation of the objective function at the optimizing kernel gives the optimized value 

of Tmix. □ 

Remark 7.3. Let n — > oo. For the optimal kernel of Theorem 17. II we have Tmix ~ 
^n^, whereas for both pi = 1/n (the guess for optimality in [19]) and the choice 
Pi (X {i -\- l){n — i) minimizing SLEM we have Tmix = ^i^ii^ + 1)('^ + 2) ~ 

8. Can extra updates delay mixing? 
(no, subject to positive correlations) 

Can extra updates delay mixing? This question is the title of a paper |23j by 
Yuval Peres and Peter Winkler (see also Holroyd [T7] for counterexamples). Peres 
and Winkler show that the answer is no, for total variation distance, in the setting 
of monotone spin systems, generalized by replacing the set of spins {0,1} by any 
linearly ordered set. (We review relevant terminology below.) In Theorem 18.31 we 
recapture and extend their result using comparison inequalities by showing that 
A'u ^ / for any kernel Ky that updates a single site v, i.e., that the identity 
kernel [as for the monotone birth-and-death example, see Remark l5.2f al] only slows 
mixing (when the initial pmf has non-increasing ratio with respect to the stationary 
pmf) — because then, noting reversibility and stochastic monotonicity of each Ky 
and applying Proposition 12.41 for any vi, . . . ,Vt the product Ky-^ ■ ■ ■ Ky^ increases 
in < by deletion of any Ky^ . The comparison inequality Ky < I holds in the 
more general setting of a partially ordered set of "spins", subject to the following 
restriction: Starting with distribution tt and a site v and conditioning on the spins 
at all sites other than v, the conditional law of the spin at v should have positive 
correlations (as, of course, does any distribution on a linearly ordered set). 
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8.1. Positive correlations. Recall that a pmf tt on a finite partially ordered set X 
is said to have positive correlations if (in the notation of Section [2]) 

(/,5> > (/,1>(.9,1) 

for every f,g G Ai, and that if S is linearly ordered then (by "Chebyshev's other 
inequality"; see, e.g., |22[ Lemma 16.2]) all probability measures have positive corre- 
lations. The connection with comparison inequalities is the following simple lemma, 
in relation to which we note that both K^^ and / are stochastically monotone kernels 
possessing stationary distribution tt. 

Lemma 8.1. A pmf tt on a finite partially ordered set X has positive correlations 
if and only if K^^ ^ /, where K^^ is the trivial kernel that jumps in one step to tt 
and I is the identity kernel. 

Proof. Since for any / and g we have 

{K^f.g) = {{f,l),g) = (/,l)(.g,l) 

and {If,g) = {f,9), the lemma is proved. □ 

Proposition 8.2. Let n be a pmf on a finite partially ordered set. Partition X , 
suppose that a given kernel K on X is a direct sum [as in Proposition \2.3Y c )] of 
trivial kernels Ki ( as in Lemma \8.1\) on the cells of the partition, and suppose that tt 
conditioned to each cell has positive correlations. Then K ^ /. 

Proof. Simply combine Lemma 18.11 and Proposition I2.3f c) . □ 

8.2. Monotone spin systems. Our setting is the following. We are given a finite 
graph G = (V, E) and a finite partially ordered set S of "spin values". A spin config- 
uration is an assignment of spins to vertices (sites) , and our state space is the set X 
of all configurations. We are given a pmf t: on X that is monotone in the sense that, 
when we start with tt and any site v and condition on the spins at all sites other 
than V, the conditional law of the spin at v is monotone in the conditioning spins. 
We recover and (modestly) extend the Peres- Winkler result by means of the follow- 
ing theorem, which (i) allows somewhat more general S and (ii) encompasses — by 
means of Proposition 13.21 Corollarv I3.3f a)-(b). and Remark 13.41 — separation and 
i^-distance as well as TV. 

Theorem 8.3. Fix a site v, and suppose that the conditional distributions discussed 
in the preceding paragraph all have positive correlations. Let Ky be the (stochas- 
tically monotone ) Markov kernel for update at site v according to the conditional 
distributions discussed. Then we have the comparison inequality Ky ^ /. 

Proof. Say that two configurations are equivalent if they differ at most in their spin 
at V, and let [x] denote the equivalence class containing a given configuration x. 
Then Ky is given by 

Ky{x,y)^l{ye[x])^. 

This Ky is the direct sum of the trivial kernels (as in Lemma 18. ip on each equiv- 
alence class. Further, each class is naturally isomorphic as a partially ordered set 
to S and so has positive correlations. It is well known and easily checked that Ky 
is stochastically monotone, so the theorem is an immediate consequence of Propo- 
sition El □ 
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Remark 8.4. [random vs. systematic site updates] It follows [from Theo- 
rem |8l3] and Proposition I2.3r b)] for monotone spin systems with (say) linearly or- 
dered S that, when the chains start from a common pmf having non-increasing ratio 
relative to tt, the "systematic site updates" chain with kernel A'gyst ■— Kvi • ■ ■ 
(for any ordering vi,. .. ,Vi, of the sites v £ V) mixes faster in TV, sep, and 
than does the "random site updates" chain with kernel -ftTrand "^vev P"^^ l^*-"^ 
any pmf p ~ {pv)vev on V\. This is because (recalling the paragraph preceding 
Proposition [2l3|) the reversible kernel Xiand is stochastically monotone, as are ^sTgyst 
and its time- reversal, and Kgyst ^rand- [The explanation for the comparison here 
is that (as noted in the first paragraph of this section) i^syst ^ Ky for each v €V 
and (by Proposition 12. 3f b)) the relation ^ on K. is preserved under mixtures. [ It is 
important to keep in mind here that one "sweep" of the sites using Xgyst is counted 
as only one Markov-chain step. 

There is a very weak ordering in the opposite direction: Jf^^^nd — P^syst + (l^p)^! 
withp := Yly^yVv 

8.3. Extra updates don't delay mixing: card-shuffling. The following card- 
shufHing Markov chain, which has been studied quite a bit (see [3] and references 
therein) in the time-homogeneous "random updates" case where update positions 
are chosen independently and uniformly, is another example where comparison 
inequalities can be used to show that extra updates do not delay mixing. 

Our state space is the set X of all permutations of {1, . . . , n}, and there is a 
parameter p € (0, 1). Given i g {1, . . . , n — 1}, we can update adjacent positions i 
and z -I- 1 by sorting (i.e., putting into natural order) the two cards (numbers) 
in those positions with probability p and "anti-sorting" them with the remaining 
probability. Call the update kernel Ki. It is straightforward to check that each Ki 
is (i) reversible with respect to tt, where inv(x) is the number of inversions in the 
permutation x and n{x) is proportional to [(1 — p)/p]™^^^'> [indeed, Ki{x, •) is the 
law of a permutation drawn from tt but conditioned to agree with x at all positions 
other than i and J -|- 1[ , and (ii) stochastically monotone with respect to the Bruhat 
order on X (defined so that x < y \i y can be obtained from x by a sequence of 
anti-sorts of not necessarily adjacent cards) Q 




Theorem 8.5. Fix a position i G {1, . . . , 7i — 1}, and let Ki be the Markov kernel 
for update of positions i and t + 1 as discussed in the preceding paragraph. Then 
we have the comparison inequality Ki ^ /. 

The proof of Theorem l8.5l is essentially the same as for Theorem IS . 31 and therefore 
is omitted. The key is that the relevant equivalence classes now consist of only two 
permutations each and so are certainly linearly ordered, therefore having positive 
correlations. 

8.4. A final example. In a specific setting (linearly ordered state space and uni- 
form stationary distribution) we have K :< I quite generally: 

Theorem 8.6. Let X be a linearly ordered state space. If K is doubly stochastic, 
then K d I (with respect to uniform tt). 

^To establish the monotonicity of Ki , it is sufficient to consider initial states x and y where y is 
obtained from x by a single anti-sort of two not necessarily adjacent cards and couple transitions 
from these states so that the corresponding terminal states, call them Xi and Yi, satisfy X\ < Yi. 
A coupling that one can check works (by considering various cases) is to make the same decision, 
for X and for y, to sort or to anti-sort the cards in positions i and i + 1. 




FASTEST-MIXING MARKOV CHAINS 
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Remark 8.7. (a) When tt is uniform, to say that a kernel K is doubly stochastic 
is precisely to say that tt is stationary for K. If K is symmetric, then Theorem 18.61 
applies. Thus inserting a monotone symmetric kernel (or, more generally, a mono- 
tone doubly stochastic kernel whose transpose is also monotone) in a list of such 
kernels to be applied never slows mixing (by Proposition 12. 4[ or the more general 
Corollarv l2.81 and the results of Section [3|) when the initial pmf is non- increasing. 

(b) If "linearly ordered" is relaxed to "partially ordered" in Theorem 18.61 the 
result is not generally true, even for monotone K. This follows from Lemma 18. 1[ 
since there are partially ordered sets for which the uniform distribution does not 
have positive correlations. 

Proof of Theorem \8.6l We must show that {Kf,g) < {f,g) when / and g are non- 
negative and belong to A4 (i.e., are non- increasing) and (without loss of general- 
ity) / sums to 1. It is a fundamental result in the theory of majorization |21j that / 
majorizes Kf if K is doubly stochastic. Since X is linearly ordered and / belongs 
to Ai, it follows that, regarded as pmfs, / and Kf satisfy Kf > f stochastically. 
Therefore, for g G we have {Kf,g) < {f,g), as desired. □ 
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